Phylogeny-aware identification and correction of taxonomically mislabeled sequences
نویسندگان
چکیده
Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences ('mislabels') using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https://github.com/amkozlov/sativa.
منابع مشابه
Phylogeny of gazelles in some islands of Iran based on mtDNA sequences: Species identification and implications for conservation
Different species of gazelles are among the most endangered mammals on the Asian steppes and occur in the central, southern and northwestern regions of Iran. The previous conservation efforts in this region have been incomplete due to confusion about the phylogenetic relationship among various populations. So that, different conservation programs such as ex-situ breeding and transfer of captive...
متن کاملMorphology and Phylogeny of Scrippsiella trochoidea (Dinophyceae) a potentially harmful bloom forming species isolated from the sediments of Iran’s south coast
Phytoplankton cells and resting cysts of the species Scrippsiella trochoidea are regular and dominant components of the dinoflagellate flora of coastal marine waters and sediments around the world. This species is a common harmful bloom forming species in coastal waters. In this study, for the first time cyst of S. trochoidea were isolated from the sediments of southeast coast of Iran. Five str...
متن کاملMolecular phylogeny of some avian species using Cytochrome b gene sequence analysis
Veritable identification and differentiation of avian species is a vital step in conservative, taxonomic, forensic, legal and other ornithological interventions. Therefore, this study involved the application of molecular approach to identify some avian species i.e. Chicken (Gallus gallus), Muskovy duck (Cairina moschata), Japanese quail (Coturnix japonica), Laughing dove (Streptopelia senegale...
متن کاملThe Molecular Identification of four Species of Gastropoda on Rocky Shores of the Persian Gulf
In this study molecular identification of four species of Gastropoda have been studied based on mitochondrial genes, COI and 16S rRNA in the present work for the first time in the northern rocky coastal zones of Persian Gulf during years 2013 and 2014. Planaxis sulcatus، Cerithidea cingulata، Siphonaria savignyi and Onchidium peronii Identified. After morphological identification, DNA extractio...
متن کاملA DNA sequence-based identification checklist for Taiwanese chondrichthyans.
In an effort to establish a DNA sequence based checklist of the highly diverse chondrichthyan fauna of Taiwan, we sequenced the mitochondrial NADH2 gene of 257 freshly sampled specimens of Taiwanese chondrichthyans, which were identified to species level by experts in the field. The newly generated sequences were analysed in the context of an already published phylogeny based on NADH2 sequences...
متن کامل